Add check_nm_dataset() for automated NONMEM dataset validation by roninsightrx · Pull Request #7 · InsightRX/irxforge

roninsightrx · 2026-04-03T19:08:32Z

Summary

Adds check_nm_dataset() — a comprehensive validation function that runs 35+ automated checks on NONMEM datasets
Returns a tidy data.frame with check name, description, and result (PASS/FAIL/NA) without throwing errors
Checks cover: column naming & reserved names, EVID/AMT/DV/MDV/CMT consistency, TIME ordering, ADDL/II overlap detection, exclusion flags, covariate missingness, numeric precision, CSV safety, and more
Includes 31 tests covering all check categories
Version bumped to 0.0.0.9005

Test plan

devtools::test(filter = "check_nm_dataset") — 31/31 passing
Manual testing with real NONMEM datasets

🤖 Generated with Claude Code

Adds a comprehensive validation function that runs 35+ checks on NONMEM datasets and returns a summary table (check/description/result) without throwing errors. Covers column naming, EVID/AMT/DV/MDV consistency, TIME ordering, ADDL/II overlap detection, reserved names, covariate missingness, and more. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces a new exported validation utility, check_nm_dataset(), to run a broad suite of automated consistency/sanity checks on NONMEM-style datasets and return a tidy PASS/FAIL/NA summary table without throwing errors.

Changes:

Added check_nm_dataset() implementation with ~35+ checks and optional verbose logging.
Added a dedicated testthat file with 31 unit tests covering the new check categories.
Exported the function, generated Rd documentation, and bumped the package version.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`R/check_nm_dataset.R`	New dataset validation function implementing the check suite and verbose reporting.
`tests/testthat/test-check_nm_dataset.R`	New unit tests validating expected PASS/FAIL behavior for each check.
`NAMESPACE`	Exports `check_nm_dataset`.
`man/check_nm_dataset.Rd`	Generated documentation for the new function.
`DESCRIPTION`	Version bumped to `0.0.0.9005`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T19:20:33Z

R/check_nm_dataset.R

+  has_col <- function(col) col %in% names(data)
+  cols <- toupper(names(data))
+


Column handling is inconsistent: some checks use cols <- toupper(names(data)) (case-insensitive), but has_col() and most downstream accessors use case-sensitive names(data) / data$TIME / data$EVID. This can cause required columns to be reported as present while subsequent checks are silently skipped (or covariates misclassified) when inputs use lowercase/mixed-case column names. Consider normalizing names once (e.g., map to uppercase internally) or implementing case-insensitive lookup helpers used consistently throughout the function.

Copilot · 2026-04-03T19:20:34Z

R/check_nm_dataset.R

+    max_len <- max(nchar(ids))
+    log_check(
+      "id_digits",
+      paste0("ID variable has <= ", max_id_digits, " digits (longer may need $TABLE FORMAT)"),
+      if (max_len <= max_id_digits) "PASS" else "FAIL"
+    )


max_len <- max(nchar(ids)) will return NA if any ID is missing, and the subsequent if (max_len <= max_id_digits) will error with “missing value where TRUE/FALSE needed”. Since the function promises to not throw errors, this should use na.rm = TRUE (and handle the all-NA case explicitly) or validate/flag missing IDs before computing max_len.

Suggested change

max_len <- max(nchar(ids))

log_check(

"id_digits",

paste0("ID variable has <= ", max_id_digits, " digits (longer may need $TABLE FORMAT)"),

if (max_len <= max_id_digits) "PASS" else "FAIL"

)

non_missing_ids <- ids[!is.na(ids)]

if (length(non_missing_ids) == 0) {

log_check(

"id_digits",

paste0("ID variable has <= ", max_id_digits, " digits (longer may need $TABLE FORMAT)"),

"NA"

)

} else {

max_len <- max(nchar(non_missing_ids))

log_check(

"id_digits",

paste0("ID variable has <= ", max_id_digits, " digits (longer may need $TABLE FORMAT)"),

if (max_len <= max_id_digits) "PASS" else "FAIL"

)

}

R/check_nm_dataset.R

Copilot · 2026-04-03T19:20:35Z

R/check_nm_dataset.R

+  if (has_col("ADDL") && has_col("II") && has_col("EVID") && has_col("ID") &&
+      has_col("TIME") && is.numeric(data$ADDL) && is.numeric(data$II)) {
+    overlap_found <- FALSE
+    for (id in unique(data$ID)) {
+      sub <- data[data$ID == id, ]
+      dose_idx <- which(sub$EVID %in% c(1, 4))
+      if (length(dose_idx) < 2) next
+      for (i in seq_len(length(dose_idx) - 1)) {
+        row_i <- dose_idx[i]
+        row_next <- dose_idx[i + 1]
+        addl_i <- sub$ADDL[row_i]
+        ii_i <- sub$II[row_i]
+        if (!is.na(addl_i) && !is.na(ii_i) && addl_i > 0 && ii_i > 0) {
+          last_expanded <- sub$TIME[row_i] + addl_i * ii_i
+          if (last_expanded >= sub$TIME[row_next]) {
+            overlap_found <- TRUE


addl_overlap detection iterates dosing records in their current row order (dose_idx), but overlap is a time-based concept and the dataset may be unsorted (you separately check data_ordered). As written, an unsorted dataset can yield false positives/negatives for overlap. Consider sorting sub by TIME (and possibly by additional tie-breakers) before computing overlaps, or only running this check when the ordering check passes.

tests/testthat/test-check_nm_dataset.R

Allows mapping non-standard column names to expected NONMEM names via a dictionary argument (e.g., list(ID = "SUBJ", TIME = "TAD")). All checks now resolve column names through the dictionary. Keys are case-insensitive and unknown keys produce a warning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Reviewer feedback: the MDV/DV checks were internally inconsistent. mdv_dv_consistent used only is.na(DV), but dv_leq_zero_mdv expected DV <= 0 to have MDV=1, and mdv_nonmissing_obs used DV != 0 instead of DV > 0. A valid dataset with BLQ records could not pass all three. Changes: - Unified "missing/BLQ" = is.na(DV) | DV <= 0 across all MDV checks - mdv_dv_consistent now expects MDV=1 for NA or DV<=0, MDV=0 for DV>0 - Removed redundant dv_leq_zero_mdv check (covered by updated above) - Fixed mdv_nonmissing_obs filter to use DV > 0 (matching description) - Added test for BLQ (DV <= 0) MDV consistency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The overlap check iterated dose records in row order, which could produce false positives/negatives on unsorted datasets. Now sorts dose_idx by TIME within each ID before comparing expanded dose windows against subsequent doses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

roninsightrx requested a review from Copilot April 3, 2026 19:16

Copilot started reviewing on behalf of roninsightrx April 3, 2026 19:17 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

roninsightrx and others added 4 commits April 3, 2026 12:39

Update R/check_nm_dataset.R

d957d48

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

roninsightrx merged commit cb628f1 into main Apr 3, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add check_nm_dataset() for automated NONMEM dataset validation#7

Add check_nm_dataset() for automated NONMEM dataset validation#7
roninsightrx merged 5 commits intomainfrom
check-nm-dataset

roninsightrx commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		has_col <- function(col) col %in% names(data)
		cols <- toupper(names(data))

Conversation

roninsightrx commented Apr 3, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants